Among European soccer competitions, the Premier League in England is the most watched soccer league globally. This is in part due to its highly competitive nature in which any of the 20 teams can defeat the other in a given week. This also makes the Fantasy Premier League (FPL) equally exciting since its not always easy to predict the outcome of a match. The aim of this report is to explore how FPL data can be leveraged to give FPL players (also known as managers) a competitive advantage.
library(tidyverse)
library(knitr)
library(plotly)
library(DT)
The selected data set is sourced from a Git repository managed by FPL enthusiasts. It consists of performance data for each player in the game on a weekly basis.
fpl.df <- read.csv(
"https://raw.githubusercontent.com/vaastav/Fantasy-Premier-League/master/data/2021-22/gws/merged_gw.csv",
encoding = 'UTF-8')
#If import from Git fails or prose below doesn't fit visualizations, un-comment line below
#fpl.df <- read.csv("merged_gw.csv", encoding = 'UTF-8')
colnames(fpl.df)
Each row in the data set reflects a player’s current data for that given game week. This includes the player’s stats, current value on the transfer market and details of their respective fixture.
Here is a full list of the variables contained in the data set and their definitions:
After reading in the data, I noticed there were some variables such as “round” and “element” which I was unsure of. The data set didn’t come with a dictionary of variables and I’m confident I defined the other variables correctly. Therefore my next step was to exclude these unknowns from the data along with other features that are likely irrelevant to player analysis.
The opponent team variable contained numbers so it’s values needed to be matched with the corresponding team names. Moreover, in the game, player values (prices) are denoted in tenths so the value variable had to be mutated to reflect this.
Lastly, the data set was missing the selected % variable, so I mutated this as well. This variable refers to the percentage of managers selected a player during a particular game week. I also checked for missing values and verified the data is complete without NAs.
#Drop columns
fpl.tbl <- select(fpl.df, -c(kickoff_time, element, fixture, round))
# Change factor level names
fpl.tbl <- fpl.tbl %>%
mutate(opponent_team = recode_factor(opponent_team,
`1` = "Arsenal",
`2` = "Aston Villa",
`3` = "Brentford",
`4` = "Brighton",
`5` = "Burnley",
`6` = "Chelsea",
`7` = "Crystal Palace",
`8` = "Everton",
`9` = "Leeds Utd",
`10` = "Leicester",
`11` = "Liverpool",
`12` = "Man City",
`13` = "Man Utd",
`14` = "Newcastle",
`15` = "Norwich",
`16` = "Southampton",
`17` = "Spurs",
`18` = "Watford",
`19` = "West Ham",
`20` = "Wolves",))
# Change player value to match values in game
fpl.tbl <- fpl.tbl %>%
mutate(value = value / 10)
#Create selected % variable (9m players)
fpl.tbl <- fpl.tbl %>%
mutate(selected_pc = round(100*(selected / 9000000), 2))
#Check for missing values
sum(is.na(fpl.df))
## [1] 0
## `summarise()` has grouped output by 'name', 'position', 'team', 'value', 'total_points', 'goals_scored', 'assists', 'selected_pc', 'clean_sheets'. You can override using the `.groups` argument.
| name | position | team | value | total_points | goals_scored | assists | selected_pc | clean_sheets | GW |
|---|---|---|---|---|---|---|---|---|---|
| Dejan Kulusevski | MID | Spurs | 6.3 | 16 | 2 | 0 | 10.80 | 1 | 38 |
| Cédric Soares | DEF | Arsenal | 4.2 | 14 | 1 | 1 | 0.43 | 0 | 38 |
| Ilkay Gündogan | MID | Man City | 7.1 | 14 | 2 | 0 | 2.61 | 0 | 38 |
| Ayoze Pérez | MID | Leicester | 5.7 | 13 | 2 | 0 | 0.12 | 0 | 38 |
| Callum Wilson | FWD | Newcastle | 7.1 | 13 | 2 | 0 | 1.60 | 0 | 38 |
| James Maddison | MID | Leicester | 6.9 | 13 | 1 | 1 | 16.89 | 0 | 38 |
| Pascal Groß | MID | Brighton | 5.6 | 13 | 1 | 1 | 0.27 | 0 | 38 |
| Gabriel Teodoro Martinelli Silva | MID | Arsenal | 5.3 | 12 | 1 | 1 | 3.17 | 0 | 38 |
| Heung-Min Son | MID | Spurs | 11.2 | 12 | 2 | 0 | 36.96 | 1 | 38 |
| Danny Welbeck | FWD | Brighton | 6.0 | 11 | 1 | 1 | 1.77 | 0 | 38 |
| Rodrigo Bentancur | MID | Spurs | 4.9 | 11 | 0 | 2 | 0.09 | 1 | 38 |
| Harry Kane | FWD | Spurs | 12.5 | 10 | 1 | 1 | 24.41 | 1 | 38 |
| Jack Harrison | MID | Leeds | 5.5 | 10 | 1 | 0 | 1.56 | 0 | 38 |
| Jamie Vardy | FWD | Leicester | 10.3 | 10 | 1 | 1 | 12.58 | 0 | 38 |
| Vicente Guaita | GK | Crystal Palace | 4.6 | 10 | 0 | 0 | 5.61 | 1 | 38 |
This is a list of the players who scored 10 points or higher in the last game week which is considered to be excellent returns for FPL managers. On first glance, we notice that there is distinguishable variance between player values. These values represent the price of the player for a given game week and will fluctuate throughout the season based on how many managers have transferred that player in/out. In basic terms: the more popular a player, the higher the price a manager will have to pay to transfer them in.
At the start of each season, managers will choose a squad of 15 players from a pool of 590 players. Managers can select up to 3 players from one team and a budget of 100m. Your team’s overall rank is determined by the total number of points your squad accumulated over the course of the season. Naturally, most managers will choose the best players from the top teams in the league for their reliability in delivering points every week. However this also means that managers will likely have very similar teams, so managers will sometimes gamble on transferring in alternate players that can outscore the mainstream pick. This is what FPL managers call a differential.
For example, last game week Cédric Soares delivered excellent returns with 1 goal(s) and 1 assist(s) and 0 clean sheet(s) for a total of 14 points making him a good differential. Therefore, investing in low-value, high potential players early is a popular strategy among FPL managers - everyone is always looking for the next hidden gem.
Taking that into consideration, it would be interesting to explore the relationship between the total points scored by a player and other variables to see if there are any useful indicators.
Let’s continue with the Cédric Soares example and compare him against other players in the same position.
NOTE: Rescheduled fixtures can sometimes lead to double game weeks where certain teams may play two matches instead of one in a single game week. Adding double game week players in the squad is a favorable strategy for FPL managers since they have the potential for a higher points output.
For the purpose of comparison, I omitted players who’ve played less than 45 minutes in a game week to get a fair average of the total points in the DEF position. Last game week Cédric Soares outperformed the average by 11.4 points and his points total of 14 is 10.8x his average of 1 points this season. Given his current form, he might be a cheap differential for managers to consider adding to their squads.
Expanding on this, I was also curious about how the total points spread looks among the different player positions. This would give insight into how a manager might allocate their budget when planning their squad selections.
## `summarise()` has grouped output by 'name'. You can override using the `.groups` argument.
The box plots above reveal some useful insights. Goalkeepers have the highest median of the four positions, the other positions appear to have comparable medians. This makes sense with respect to the game, because most outfield players will play two halves of a game, earning a point for each half and not score any additional points.
I also noticed that the box plots are all negatively skewed, some more than others. For instance, half of the forwards scored between 1 point and 36.5 points this season, while the other half scored between 36.5 and 110. This indicates that most forwards score well below the median, therefore a FPL manager would be encouraged to focus their budget on other positions such as midfielders and defenders.
What interested me the most, is that the box plot for defenders. It has the highest median (37) of all outfield positions. This is likely due to there being two different types of defenders in soccer: center backs and full backs. In most teams, center backs will typically remain in a defensive position throughout the match earning their managers points by earning clean sheets (not conceding goals or yellow/red cards). Meanwhile, full backs are likely join attacks and can earn points for not only clean sheets, but assists and goals too.
I was also interested in how FPL managers shift their budget around week to week. The following visualization plots player value against the number of managers that transferred that player into their team. I examined the last 6 game weeks to see if I could spot any patterns.
There are a few outliers. For example, Bukayo Saka was a very popular pick in GW 26 with 784,691 transfers in. He is not visible in any of the previous game weeks’ plots so we can infer that the high number of transfers is due to favorable fixtures rather than form. By comparison, Philippe Coutinho was also heavily transferred in during GW 23 and again in 25, this time with a price increase. His observation’s movement is more likely explained by form as confirmed by the table below. But perhaps there are other variables at play here that influence a manager’s decision to bring a player in.
| GW | name | position | goals_scored | assists | clean_sheets | total_points | bonus |
|---|---|---|---|---|---|---|---|
| 24 | Bukayo Saka | MID | 0 | 0 | 1 | 3 | 0 |
| 26 | Bukayo Saka | MID | 1 | 0 | 0 | 10 | 3 |
| 26 | Bukayo Saka | MID | 0 | 0 | 0 | 1 | 0 |
| 28 | Bukayo Saka | MID | 1 | 1 | 0 | 12 | 2 |
| 29 | Bukayo Saka | MID | 0 | 0 | 1 | 3 | 0 |
| 29 | Bukayo Saka | MID | 0 | 0 | 0 | 2 | 0 |
| 24 | Philippe Coutinho Correia | MID | 1 | 2 | 0 | 16 | 3 |
| 25 | Philippe Coutinho Correia | MID | 0 | 0 | 0 | 2 | 0 |
| 26 | Philippe Coutinho Correia | MID | 0 | 0 | 0 | 2 | 0 |
| 27 | Philippe Coutinho Correia | MID | 0 | 0 | 1 | 3 | 0 |
| 28 | Philippe Coutinho Correia | MID | 1 | 1 | 1 | 13 | 2 |
| 28 | Philippe Coutinho Correia | MID | 1 | 0 | 1 | 10 | 2 |
| 29 | Philippe Coutinho Correia | MID | 0 | 0 | 0 | 2 | 0 |
To get a better idea of the importance of the variables in the data, I ran a linear regression analysis. I set the target variable as ‘total_points’ and all else as predictor variables. I then removed a variable that wasn’t statistically significant and recalculated the model, repeating these steps until all my variables were statistically significant at the 5% level (i.e. the predictor’s respective p-value is less than 0.05).
options(scipen=4)
fpl.lm <- lm(total_points ~ .-name-team-opponent_team-position-was_home-selected-team_h_score-selected_pc-transfers_in-transfers_out-GW-team_a_score-value, data = fpl.tbl)
fpl.lm.summary <- coef(summary(fpl.lm))
#Show only statistically significant factors and sort Estimate in descending order
#fpl.lm.summary <- fpl.lm.summary[fpl.lm.summary[,"Pr(>|t|)"]<0.05,]
e <- order(fpl.lm.summary[,1], decreasing = TRUE)
fpl.lm.summary <- fpl.lm.summary[e, ]
fpl.lm.summary
## Estimate Std. Error t value Pr(>|t|)
## penalties_saved 3.7888468691774 0.16041583119964 23.618909 5.657739e-122
## goals_scored 3.4270758019791 0.03043731477475 112.594551 0.000000e+00
## assists 2.2383777209749 0.02066903193792 108.296205 0.000000e+00
## clean_sheets 1.5579311487699 0.01809074308013 86.117587 0.000000e+00
## bonus 1.0419603413301 0.01086005199363 95.944323 0.000000e+00
## saves 0.1403258735019 0.00669983204987 20.944685 1.468411e-96
## threat 0.1218659461928 0.01008339108003 12.085810 1.563242e-33
## creativity 0.1103086240994 0.01010126962663 10.920273 1.067991e-27
## influence 0.0983089132623 0.01016851051323 9.667976 4.515919e-22
## bps 0.0924229176121 0.00117227953748 78.840340 0.000000e+00
## xP 0.0521672312848 0.00215830649523 24.170446 1.410983e-127
## (Intercept) 0.0423973988642 0.00423108771537 10.020449 1.375831e-23
## minutes 0.0147129046894 0.00023819170568 61.769173 0.000000e+00
## transfers_balance -0.0000004306755 0.00000005142704 -8.374494 5.838785e-17
## goals_conceded -0.2245241034423 0.00609997012514 -36.807410 9.806353e-289
## yellow_cards -0.6503881768877 0.01640671036578 -39.641596 0.000000e+00
## ict_index -1.1874840262636 0.10080735136901 -11.779736 6.053077e-32
## penalties_missed -1.2566537619603 0.12421869279614 -10.116463 5.204862e-24
## own_goals -2.0007856083361 0.09427678031735 -21.222464 4.602776e-99
## red_cards -2.1170552958805 0.08430300598103 -25.112453 1.976054e-137
After running the regression analysis, I excluded variables that weren’t staistically significant at the 5% level. The output above ranks the remaining coefficients in order of their impact on total points scored for a player in a given game week. Penalties saved tops the list as the most influential factor with an average of 3.79 points earned for every penalty saved. However, this stat applies only to goalkeepers so it shouldn’t be considered when analyzing outfield players.
Goals scored, assists and clean sheets are also key variables that influence total points but this expected since the game rewards players the most points for these actions. For instance, holding all other variables constant, every goal scored earns on average 3.43 points. Although the model fits, the results are not very useful.
After fitting my data to the linear regression model, I created a diagnostic plot to confirm this model is appropriate for the data set. In the Residuals vs Fitted plot, the regression line deviates slightly but is roughly horizontal so the residuals follow a almost linear pattern.
Looking at the Q-Q (Quantile-Quantile plot), we find that the distribution is over-dispersed. This is indicates that we have a high number of outliers and the tails of the distribution are fatter. Although, we could already sense from the box plots that the data is not normally distributed.
With respect to the Scale-Location plot, the regression line is roughly horizontal and there is no visible pattern among residuals.This is indicating that the variance among residuals is equal at all fitted values, therefore the model fits well.
Finally, the Residuals vs Leverage plot tells us that there are not overly influential points in the data set. Observation 4500 is the closest to the border of Cook’s distance but not beyond the point it would be considered influential.
In conclusion, my analysis has determined that investing in midfielders and attacking defenders is probably the most useful strategy. However some seasoned FPL managers might argue that this is likely the case in the current season this report was produced (2021/22). Although the analysis didn’t uncover any ground-breaking insights, it definitely exposed the limitations of the data set. Running a linear regression with the current set of variables is compromised by collinearity between the total points and actions that award points such as goals and assists.
If I were to repeat this study in the future, I would pair this data with a secondary data set consisting of players’ underlying stats. This can include features such as the number of interceptions a defender makes per game or the number of touches a forward takes in the final third. These variables are more closely tied to player performance, so a more granular approach would likely yield more useful results.